CGG toolkit: Software components for computational genomics

Public-domain availability for bioinformatics software resources is a key requirement that ensures long-term permanence and methodological reproducibility for research and development across the life sciences. These issues are particularly critical for widely used, efficient, and well-proven methods, especially those developed in research settings that often face funding discontinuities. We re-launch a range of established software components for computational genomics, as legacy version 1.0.1, suitable for sequence matching, masking, searching, clustering and visualization for protein family discovery, annotation and functional characterization on a genome scale. These applications are made available online as open source and include MagicMatch, GeneCAST, support scripts for CoGenT-like sequence collections, GeneRAGE and DifFuse, supported by centrally administered bioinformatics infrastructure funding. The toolkit may also be conceived as a flexible genome comparison software pipeline that supports research in this domain. We illustrate basic use by examples and pictorial representations of the registered tools, which are further described with appropriate documentation files in the corresponding GitHub release.


Introduction
Genome sequence analysis represents one of the most fundamental elements of computational genomics.It supports structural, comparative and functional genomics, and forms the foundation upon which systematic structure/function prediction, classification and annotation of proteins is based [1].In addition, it establishes genome-scale properties of species, their relationships and the mapping of encoded genomic components (such as gene loci or protein sequences and structures) to dynamic properties revealed by large-scale genome-scale experiments [2].Finally, genome sequence analysis is used in taxonomy such as species phylogenies [3], genetics such as protein family discovery [4,5], and biochemistry such as metabolic pathway reconstructions [6].
In the past, we developed a series of algorithmic components and introduced their software implementations for use in large-scale genome sequence analysis [7].The Computational Genomics Group (CGG) at the European Bioinformatics Institute (1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)) maintained a server with these key tools available during the years 1997-2008 (https://web.archive.org/web/*/cgg.ebi.ac.uk) at the URL cgg.ebi.ac.uk (aliased as genomes.org),until hardware and other changes forced the discontinuation of these services.In particular, at the URL http://cgg.ebi.ac.uk/cgg/Services.htmlthese tools were available until 2008 (Fig 1 ), with a number of popular software modules either as downloadable source/binary files or as interactive, precomputed solutions (https://web.archive.org/web/20080105003605/http://cgg.ebi.ac.uk/cgg/Services.html).For the following decade or so, every effort had been made to deliver those software components to users by responding to direct requests or maintaining services elsewhereadmittedly a sub-optimal solution, yet the only realistic alternative to ensure public access.
It is widely accepted that most URLs published in the recent literature have a limited lifecycle [8], due to discontinued financial support, movement of personnel, inability to scale up resource deployment, and other policy or infrastructure reasons [9].Unfortunately, the unpredictable and often inadequate grant application review process [10,11] results in the abrupt discontinuation of important bioinformatics services with serious implications for interdisciplinary research [12].Recent efforts to re-establish and distribute published software in the field have been strongly supported by ELIXIR, the principal bioinformatics infrastructure project across Europe [13].With the creation of funding streams both at the European and national node levels, it has become possible to revive and thus re-distribute previously published services on new platforms not previously available.
Thanks to these developments, we are now re-launching the key software modules of CGG services into the public domain that we used in our work for comparative genomics while at the EBI and in the ensuing years elsewhere.These exclude pre-computed data collections based on large genomic computations that cannot be made available at present due to the increase in volume of genome information; yet, we illustrate how a number of those data collections might be reconstructed using our toolkit.To that end, wrappers and other control software for managing and integrating various data segments are provided, akin to a flexible computational pipeline that can be modified according to user needs and specifications, thus reviving key components for use by the wider community.This work has been made possible thanks to infrastructure funding by the project ELIXIR-GR.

Design & implementation
Here, we describe the software tools made available from the angle of usage, and not according to the order in which they were developed or published, guiding users to decide how they deploy the toolkit in various computational genomics projects.A chronological listing is reflected in the original publications (Table 1).All versions have been named v1.0.1, to avoid conflicts with the latest developments and forking of projects in subsequent work.We wish to maintain this version system for consistency and to better plan any software updates with the corresponding funding streams in the near future.We request that citations to software tools refer to both the original publication and this work to acknowledge the re-established availability.We describe the main components below.
MagicMatch is a sequence matching protocol based on the MD5 checksum for the detection of identical protein sequences [14].The MD5 algorithm for message integrity generates PLOS COMPUTATIONAL BIOLOGY fingerprints which are used as hash strings to map sequences across databases.It thus helps the mapping of entry identifiers across sequence collections, which can be a rather time-consuming and computationally complex process.MagicMatch was the first of its kind and follows a minimalist approach.Other, more complex and high-maintenance tools have been proposed [15] compared to which, and for most practical purposes, MagicMatch is superior in speed and usability.An example of use is when whole-genome protein collections are mixed with annotated datasets e.g.SwissProt [16] for quick annotation purposes as in the TRIBES database [17]: users will want to find which genome entries are present in the annotated dataset, a task that can be rapidly accomplished using MagicMatch during pre-processing of genome-scale protein sequence datasets.
GeneCAST is a tool for the sensitive detection and selective masking of low-complexity regions in protein sequences [18].The algorithm is based on multiple-pass Smith-Waterman comparisons [19] of the query sequence against all possible (i.e.20) homopolymers of amino acid residues with infinite gap penalties.The output generates the masked query sequence that can be used for high-throughput sequence searches with increased sensitivity (fewer false negative hits) and specificity (fewer false positive hits), as well as the statistics and geometry of lowcomplexity regions.The above two software components are part of the pre-processing steps of the query section for large-scale genome comparisons that typically use as target the entire protein sequence complement encoded in the genome of the corresponding organisms (Fig 2).The next component section refers to the preparation of the target data collection, that generates consistent and tractable sequence identifiers with a few critical annotation strings encoded within a userdefined identifier space.Note that in the case of all-vs-all comparisons, target and query must be identical, a step that is one of the most expensive, computationally demanding parts of genome comparison.An exception to this identity rule might be that the query set is masked by GeneCAST while the target set is not, maintaining the original sequence information, so that targets can be equivalenced back to their source using MagicMatch.
The Complete Genome Tracking (CoGenT) database was originally developed to transform an undisciplined identifier space of genome sequence collections into a highly consistent environment for both human interaction and programming convenience.By using an encoded identifier for genes and species, it aimed at reproducibility, scalability and accessibility [20].Later, CoGenT was augmented with additional plug-in components as a three-tier system named CoGenT++ [21], where much of the work on the large-scale comparisons of genomic sequences [22], the quantification of gene gain and loss [23], the ancestral reconstructions of gene content [24] and the inference of the gene content of the Last Universal Common Ancestor [25] was based.Despite progress with hardware and software acceleration, CoGenT/ CoGenT++ was not extended beyond 250 genomes, when the size of CoGenT++ reached 100 GB in 2006.By 2010, it was one of the few research group-level efforts to keep up with genome catalogs, an objective that is currently achieved only by operations such at the NCBI [26] and the EBI [27], with varying degrees of success.To achieve similar functionality, we chose to issue a set of utilities that allow users to recreate the CoGenT style of identifier encodings, named cogent_utils.These utilities are using shell scripting with some awk and sed parts that can read a catalog of genome encodings provided by the user.This action generates a directory with the collections of genomes adopting an encoding scheme comprising the species name, the version (starting with 01) and the incremental numbering of the gene list, so that each gene acquires a unique identifier.Collections can then be concatenated to obtain a full-size CoGenT-like database that may be subsequently indexed for BLAST [28] or DIAMOND [29,30] searches.The simple yet powerful schema allows the linking of genome sequences to other resources, also facilitated by MagicMatch.A snippet from the README file of cogent_utils is provided here, as an illustration.
Before: First two sequences headers in file GCF_008822105.2_bTaeGut2.pat.W.v2_protein.faa: >NP_001041718.1 alpha-synuclein [Taeniopygia guttata] >NP_001041719.1 neurocalcin-delta [Taeniopygia guttata] After: The first two sequence headers in the generated file Taeg-2p1.faa in the destination folder: Representation of a typical workflow using the reported tools.Pre-processing may start with a genome collection (database symbol, upper left), optionally mixed with a curated sequence resource such as UniProt (database symbol in green, upper left).To cross-index entries at the sequence level or simply identify them, MagicMatch can be used as an option.The sequence collection can be submitted to GeneCAST to mask compositional bias and prepare the query for sensitive searches (disk symbol with Q, lower left).For genome-scale analysis, species codes can be generated for the reference (target) set with cogent_utils, to create a uniformly named sequence set (disk symbol with R, lower middle, optionally mixed with UniProt or any other annotated collection).Sequence comparisons are executed with BLAST or other options with query Q vs. reference R (or in the case of all-vs-all, disk symbol in green-blue gradient, upper middle).The vertical gray line divides this pre-processing phase from the next phase, signifying the computationally intensive step or long wall-time.Two (non-mutually exclusive) output alternatives are shown: the pairs-list (in pink, upper right) or full alignments (also in pink, lower right).The former can be treated with clustt_utils that launches Tribe-MCL and generates protein families or can be used as input for network visualization with BioLayout or other similar software, while the latter can be further processed for GeneRAGE or DifFuse for multi-domain or genefusion detection, respectively, as well as for inspection and parsing for multiple alignments.
https://doi.org/10.1371/journal.pcbi.1011498.g002>Taeg-2p1-01-000000 NP_001041718.1 alpha-synuclein [Taeniopygia guttata] >Taeg-2p1-01-000001 NP_001041719.1 neurocalcin-delta [Taeniopygia guttata] Once the CoGenT-style sequence collections are processed for database searches and then high-throughput comparisons are executed, the resulting files might optionally generate alignments and/or a pairs-list.The pairs-list (e.g.option 6 for BLAST) is an ideal way for summarizing significant hits beyond a certain acceptable threshold value (minimal score or maximal Evalue) and can be subjected to visualization and graph clustering.We have implemented a set of utilities as bash scripts, named clustt_utils, that capture the output files of large-scale sequence comparison and prepare them for visualization and clustering.For visualization, we primarily use BioLayout, originally developed by the CGG [31] and re-implemented in java as BioLayout-java [32].Later, this component was made available at biolayout.organd evolved into Graphia [33].The pairs-list can also be used with other popular platforms such as Cytoscape [34].The script clustt_utils generates pairs-list files as input for BioLayout or Cytoscape among others.These lists represent complex sequence similarity graphs that are also used for graph-clustering, where the resulting clusters are interpreted as protein families.Tribe-MCL [35] was the first fully automated approach and the second ever to generate clusters from sequence similarity graphs, a pivotal idea simultaneously proposed by the semi-automated COG system [36] around that time.The command line interface of clustt_utils takes as arguments the pairs-list (tabular output of sequence comparisons), the name of the output file, the inflation parameter and the path for MCL.This action creates three files, an output file for visualization, the MCL output and a human-readable output file with an incremental identifier for families and the sum of members per family for further processing.
As a side-product of the collective effort to revive the CGG software, we also release tested versions of the GeneRAGE [37] and DifFuse [38] algorithms, initially implemented to detect multi-domain protein families and gene fusion events, respectively.GeneRAGE was reaching computational bottlenecks for multiple genomes around 2002, a fact that was the trigger for the exploration of other, less computationally demanding algorithms, inspiring early versions of BioLayout [31] and the subsequent adoption of graph-clustering with Tribe-MCL [35].GeneRAGE builds a binary square matrix and validates non-symmetric relationships using the Smith-Waterman dynamic programming algorithm [19], by either removing false-positive hits or correcting false-negative instances [37].DifFuse is an analogous implementation, with the difference that the matrix is not square but rectangular, where the shorter dimension represents the 'query' species for which gene fusion 'components' are requested and the longer dimension represents the 'reference' species from which gene fusion 'composites' are obtained [38][39][40].The clustering results of GeneRAGE and Tribe-MCL can also be compared, as appropriate.

Results & discussion
The impact of these contributions can be documented directly from the literature, with more than 6000 citations for the tools (6284 on 28-Mar-2023) and an additional 1000 (1125 on 28-Mar-2023) citations for other research by the CGG that explicitly used these tools during its existence (Table 1).With an average of ~24 citations/year for 20 years each, this equals to an average of 480 citations per publication, with significant deviations (Tribe-MCL as the most highly cited and MagicMatch and CoGenT++ the least cited, perhaps due to their shorter lifespan and subsequent non-availability).Some of the citing references are heavily cited as well, e.g.OrthoMCL [41] or Roary [42].
We hope that by making these components accessible again, the expert community will appreciate their merit.We also note that all software can be used without CoGenT identifiers; however, to realize the full power of the suite, it is recommended that CoGenT identifiers are generated.We kindly request that third-party efforts deploying CoGenT-style database creations and comparisons also cite the original papers accordingly.
The tools, source code and usage instructions are available on BCPL's bcpl-certh GitHub repository which can be found at https://github.com/bcpl-certh/cgg-toolkit.BioLayout can be downloaded from biolayout.org; it can perform a number of intense computations, including Tribe-MCL types of clustering but can further be used for functional genomics and other visualization activities [43].All other tools are terminal-based and require a command shell, preferably bash.The main advantage of bash is its cross-platform support and the ease and flexibility to design custom behavior.By utilizing custom bash programming, users are able to reconfigure the current toolkit, automate it and extend it according to their needs.
The suite of tools presented herein facilitates large-scale genomic comparisons with attested quality, reproducibility, efficiency and scalability.All the above software was developed with a minimalist approach in mind and modest funding resources.Yet, it has been proven to be a valuable arsenal for the development and application of key ideas in genome bioinformatics, that supported our own and multiple other research efforts.We hope that the community will embrace these tools and find novel, creative ways of using them.

Fig 1 .
Fig 1. Revived software tools.A 2008 snapshot of the 'Key software' section of the CGG website followed by services (partly shown), with the list of tools made available again.https://doi.org/10.1371/journal.pcbi.1011498.g001

Fig 2 .
Fig 2.Representation of a typical workflow using the reported tools.Pre-processing may start with a genome collection (database symbol, upper left), optionally mixed with a curated sequence resource such as UniProt (database symbol in green, upper left).To cross-index entries at the sequence level or simply identify them, MagicMatch can be used as an option.The sequence collection can be submitted to GeneCAST to mask compositional bias and prepare the query for sensitive searches (disk symbol with Q, lower left).For genome-scale analysis, species codes can be generated for the reference (target) set with cogent_utils, to create a uniformly named sequence set (disk symbol with R, lower middle, optionally mixed with UniProt or any other annotated collection).Sequence comparisons are executed with BLAST or other options with query Q vs. reference R (or in the case of all-vs-all, disk symbol in green-blue gradient, upper middle).The vertical gray line divides this pre-processing phase from the next phase, signifying the computationally intensive step or long wall-time.Two (non-mutually exclusive) output alternatives are shown: the pairs-list (in pink, upper right) or full alignments (also in pink, lower right).The former can be treated with clustt_utils that launches Tribe-MCL and generates protein families or can be used as input for network visualization with BioLayout or other similar software, while the latter can be further processed for GeneRAGE or DifFuse for multi-domain or genefusion detection, respectively, as well as for inspection and parsing for multiple alignments.

Table 1 . A list of the tools presented and selected, additional work that benefited from them.
Columns-GitHub: name of GitHub repository where the tools and documentation are available (NA: not applicable, as case study)-the prefix of the GitHub folders implies a typical workflow (outlined in Fig 2); tool: tool name (or in case of studies, a codeword); year: year of original publication; PMID: PubMed identifier; citations: number of citations reported by Google Scholar on 28-Mar-2023; citations/yr: number of citations per year since original publication; short description: self-explanatory, for further details, please see original publications.Table is sorted on PMID (which reflects the time of publication). https://doi.org/10.1371/journal.pcbi.1011498.t001